What drives the price of a car?¶

OVERVIEW

In this application, you will explore a dataset from kaggle. The original dataset contained information on 3 million used cars. The provided dataset contains information on 426K cars to ensure speed of processing. Your goal is to understand what factors make a car more or less expensive. As a result of your analysis, you should provide clear recommendations to your client -- a used car dealership -- as to what consumers value in a used car.

CRISP-DM Framework¶

No description has been provided for this image

To frame the task, throughout our practical applications we will refer back to a standard process in industry for data projects called CRISP-DM. This process provides a framework for working through a data problem. Your first step in this application will be to read through a brief overview of CRISP-DM here. After reading the overview, answer the questions below.

Business Understanding¶

From a business perspective, we are tasked with identifying key drivers for used car prices. In the CRISP-DM overview, we are asked to convert this business framing to a data problem definition. Using a few sentences, reframe the task as a data task with the appropriate technical vocabulary.

The business objective is to identify the critical attributes from Vehicle datasets and generate a model that can predict the price when key attributes are passed to the model. The important car features will help the salesperson sell used cars to buyers. To achieve the results the vehicle dataset is downloaded from the Kaggle machine learning website which gives insight into used car attributes.

The success criteria of this project are to identify the key features and create a functional model that can be used by Car dealers to drive the sales of used Cars.

This project will utilize multiple Python packages & APIs and resources from open-source content to develop a model.

Data Understanding¶

After considering the business understanding, we want to get familiar with our data. Write down some steps that you would take to get to know the dataset and identify any quality issues within. Take time to get to know the dataset and explore what information it contains and how this could be used to inform your business understanding.

In this section, the focus will be on

  1. Upload the vehicle dataset for analysis and model generation.
  2. Data analysis of numerical & categorical features and perform the quality check on attributes.
  3. Identify data incompleteness and attributes which is irrelevant due to missing data and can be dropped.
In [7]:
#### Initial Data collection ####
# The data of the used vehicle is downloaded from Kaggle. The data contains the attributes related to used cars that were sold. 
# The Goal is to find the attributes that will help the salesperson identify the key attributes customers prefer in used cars.

## Import the Python libraries ##
import pandas as pd
import plotly.express as px
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_selector
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder
from sklearn.pipeline import Pipeline
from concurrent.futures import ThreadPoolExecutor
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_squared_error
from sklearn.pipeline import Pipeline
from sklearn import set_config
set_config(display="diagram")
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SequentialFeatureSelector, SelectFromModel
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.decomposition import PCA
from plotly.subplots import make_subplots
import numpy.ma as ma
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np


## Load the CSV file:

df = pd.read_csv("data/vehicles.csv")

## Print the sample dataset
df.head()
Out[7]:
id region price year manufacturer model condition cylinders fuel odometer title_status transmission VIN drive size type paint_color state
0 7222695916 prescott 6000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN az
1 7218891961 fayetteville 11900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ar
2 7221797935 florida keys 21000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN fl
3 7222270760 worcester / central MA 1500 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ma
4 7210384030 greensboro 4900 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN nc
In [8]:
## Description of dataset:
df.describe()
Out[8]:
id price year odometer
count 4.268800e+05 4.268800e+05 425675.000000 4.224800e+05
mean 7.311487e+09 7.519903e+04 2011.235191 9.804333e+04
std 4.473170e+06 1.218228e+07 9.452120 2.138815e+05
min 7.207408e+09 0.000000e+00 1900.000000 0.000000e+00
25% 7.308143e+09 5.900000e+03 2008.000000 3.770400e+04
50% 7.312621e+09 1.395000e+04 2013.000000 8.554800e+04
75% 7.315254e+09 2.648575e+04 2017.000000 1.335425e+05
max 7.317101e+09 3.736929e+09 2022.000000 1.000000e+07
In [9]:
## The vehicle dataset consists of the following listed attributes. We can see columns with very high missing data on calculating the
## percentage of missing data. 

# Print the dataset information
print(df.info())

# Identify and print the dataset
md = pd.DataFrame(df.isnull().sum(), columns=['count']).sort_values(by='count', ascending=False)
md['percent'] = round((md['count']/4268.80),1) 
md =md.reset_index().rename(columns={'index': 'Car Attributes'})

# Plot the missing attributes in the dataset 
fig = px.bar(md, x='Car Attributes', y='percent', text= 'percent', color='Car Attributes', title = '% of missing data per attribute')
fig.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 426880 entries, 0 to 426879
Data columns (total 18 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   id            426880 non-null  int64  
 1   region        426880 non-null  object 
 2   price         426880 non-null  int64  
 3   year          425675 non-null  float64
 4   manufacturer  409234 non-null  object 
 5   model         421603 non-null  object 
 6   condition     252776 non-null  object 
 7   cylinders     249202 non-null  object 
 8   fuel          423867 non-null  object 
 9   odometer      422480 non-null  float64
 10  title_status  418638 non-null  object 
 11  transmission  424324 non-null  object 
 12  VIN           265838 non-null  object 
 13  drive         296313 non-null  object 
 14  size          120519 non-null  object 
 15  type          334022 non-null  object 
 16  paint_color   296677 non-null  object 
 17  state         426880 non-null  object 
dtypes: float64(2), int64(2), object(14)
memory usage: 58.6+ MB
None
In [10]:
## Identify the cumulative count of missing data per Row. This is help in identifying the rows where the majority of data is missing and can be dropped.
me = pd.DataFrame((df.isnull().sum(axis=1)),columns=['count']) 
me = me.groupby(['count'])[['count']].count()
me['CV'] = me['count'].cumsum()
print(me)
## Print the Bar chart
fig = px.bar(me, y='CV', text= 'CV', color='count', title = 'Cumulative count of missing data per rows(e.g. - All rows present = 34868)')
fig.show()
       count      CV
count               
0      34868   34868
1      78259  113127
2      99854  212981
3      91251  304232
4      46728  350960
5      21196  372156
6      19246  391402
7      30443  421845
8       4279  426124
9        125  426249
10       539  426788
11        24  426812
14        68  426880

Conclusion of Data Analysis¶

  1. The "size" column has 71.8% of missing data hence, the column can be dropped.
  2. Drop the ID column as it will not have much significance for the model and customer.
  3. The column "cylinders" will not add value as almost ~50% (missing data+unknown value) of data is missing. Hence this column can be dropped.
  4. The condition column can be significant to customers hence deciding to Impute the data.
  5. Drop the VIN column as it may not contribute to customer preferences.
  6. As per the graph, some rows are missing many data points, dropping the rows with more than 5 NAN values.
  7. We will select the rest of the columns and then apply the feature selection model to identify the key features.

Data Preparation¶

After our initial exploration and fine-tuning of the business understanding, it is time to construct our final dataset before modeling. Here, we want to make sure to handle any integrity issues and cleaning, the engineering of new features, any transformations that we believe should happen (scaling, logarithms, normalization, etc.), and general preparation for modeling with sklearn.

The following data cleanup exercise is done as per the previous data analysis & to prepare the data for Modelling:

  1. Drop the Identified columns due to the high percent of missing data or
  2. Drop the rows having more than 5 NAN columns
  3. Remove Junk characters from the datasets
  4. Use Data impute techniques to fill in missing data for both numeric & categorical fields
  5. Encode the data so that it can be executed in various models
  6. Scale the dataset and spilt the Test and Train datasets
  7. Analyse the dataset correlation between feature
  8. And run PCA against to analyze the dataset dimensionality
  9. Final dataset for Modelling
In [13]:
## Data cleanup after deleting the unwanted columns and after deleting the column, the rows with more than 5 missing data were deleted.
## The final chart shows the count of missing data per column. This missing data will be imputed
## 

# Drop the columns 'size','id','cylinders','VIN' due to missing data and will not add much relevance to the model. 
new_df = df.drop(['size','id','cylinders','VIN'], axis=1)

# Delete rows with more than 5 columns missing
new_df.dropna(thresh=len(new_df.columns) - 5,axis=0, inplace=True)

# Replace junk chars from the model column
pattern = r'[^a-zA-Z0-9\s]'
new_df['model'] = new_df['model'].replace(pattern, '', regex=True)

# Count of missing data per attribute
mg = pd.DataFrame(new_df.isnull().sum(), columns=['count']).sort_values(by='count', ascending=False)
display(mg)

# Plot the missing data by feature. This will help 
fig = px.bar(mg, x='count', text= 'count', color='count', width=1000, height=600 , title = 'Count of missing data per feature')
fig.show()
count
condition 173269
drive 129733
paint_color 129410
type 92033
manufacturer 17427
title_status 7424
model 5209
odometer 3759
fuel 2186
transmission 1868
year 1128
region 0
price 0
state 0
In [14]:
## Data Imputation: 
## Impute Iterative method to fill in the missing data ##
X = df_imputed = new_df.copy()

impc = ['condition','drive','year','paint_color','type','manufacturer','title_status','model','fuel','transmission','region','state']
impn = ['odometer','price']
imputerc = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
imputern = SimpleImputer(missing_values=np.nan, strategy='mean')

# Use the imputer to impute the null values in the specified columns
X[impc] = imputerc.fit_transform(X[impc])
X[impn] = imputerc.fit_transform(X[impn])
X.isnull().sum()
cat_selector = make_column_selector(dtype_include=object)

## Cardinality of categorical columns in the data frame:
Ca = pd.DataFrame(X[cat_selector(X)].nunique(), columns=['count'])
print(Ca)
              count
region          404
year            114
manufacturer     42
model         28946
condition         6
fuel              5
title_status      6
transmission      3
drive             3
type             13
paint_color      12
state            51
In [15]:
## Create a new copy of the dataset
X = df_imputed = new_df.copy()

## Select the numerical and categorical features from the dataset 

cat_selector = make_column_selector(dtype_include=object)
numerical_features = num_selector = make_column_selector(dtype_include=np.number)

## Pipeline for data imputation using simple imputer most frequent strategy 
cat_linear_processor = make_pipeline( 
    SimpleImputer(strategy='most_frequent')
)

## Pipeline for data imputation using simple imputer mean strategy 
num_linear_processor = make_pipeline(
#    StandardScaler(), 
    SimpleImputer(strategy='mean')
)

## Data Imputer Steps ##

dataImputer = ColumnTransformer(transformers=[
    ('numImputer(Starategy = Mean)', num_linear_processor, num_selector),
    ('catImputer(Startegy = Most Frequent)', cat_linear_processor, cat_selector),
],remainder='passthrough')

## Fit the data Imputer function
X = pd.DataFrame(dataImputer.fit_transform(X), columns=['price', 'year', 'odometer','region', 'manufacturer', 'model', 'condition', 'fuel', 'title_status', 'transmission', 'drive', 'type', 'paint_color', 'state'])

## X dataframe is now having Imputed values
print(X.isnull().sum())
print(' ')
dataImputer
price           0
year            0
odometer        0
region          0
manufacturer    0
model           0
condition       0
fuel            0
title_status    0
transmission    0
drive           0
type            0
paint_color     0
state           0
dtype: int64
 
Out[15]:
ColumnTransformer(remainder='passthrough',
                  transformers=[('numImputer(Starategy = Mean)',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>),
                                ('catImputer(Startegy = Most Frequent)',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='most_frequent'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>)])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough',
                  transformers=[('numImputer(Starategy = Mean)',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer())]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>),
                                ('catImputer(Startegy = Most Frequent)',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer(strategy='most_frequent'))]),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>)])
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257F6211850>
SimpleImputer()
<sklearn.compose._column_transformer.make_column_selector object at 0x00000257816E0C50>
SimpleImputer(strategy='most_frequent')
[]
passthrough
In [16]:
## Function for Target encoder
def highCardinality_encode_column(dfm, column):
    encoder = TargetEncoder()
    return encoder.fit_transform(dfm[[column]], dfm['price'])

# List of columns to encode
high_cardinality_features = ['region', 'manufacturer', 'condition','fuel', 'title_status', 'drive', 'state']
low_cardinality_features  = ['transmission', 'type', 'paint_color']
encoded_results = {}

# Use ThreadPoolExecutor for parallel processing
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(highCardinality_encode_column, X, col): col for col in high_cardinality_features}
    for future in futures:  
        col = futures[future]
        encoded_results[col] = future.result()
        
# Combine encoded results into the original DataFrame
for col, encoded in encoded_results.items():
    X[col + '_encoded'] = encoded
## The Target encoder function had the issue of encoding the model column, hence encoding it manually.
mean_encoded = X.groupby('model')['price'].mean()
X['model_encoded'] = X['model'].map(mean_encoded)

## Encoding the low cardinality features using the frequency count encoder function
for col in low_cardinality_features:
    frequency_encoded = X[col].value_counts()
    X[col + '_encoded'] = X[col].map(frequency_encoded)

# Display the DataFrame with the encoded features
X.head()
Out[16]:
price year odometer region manufacturer model condition fuel title_status transmission ... manufacturer_encoded condition_encoded fuel_encoded title_status_encoded drive_encoded state_encoded model_encoded transmission_encoded type_encoded paint_color_encoded
0 33590.0 2014.0 57923.0 auburn gmc sierra 1500 crew cab slt good gas clean other ... 30426.023100 71012.649463 73584.761409 77315.43855 107316.446103 239642.53219 35224.934498 62672 43510 208693
1 22590.0 2010.0 71229.0 auburn chevrolet silverado 1500 good gas clean other ... 115820.592547 71012.649463 73584.761409 77315.43855 107316.446103 239642.53219 20619.683389 62672 43510 31223
2 39590.0 2020.0 19160.0 auburn chevrolet silverado 1500 crew good gas clean other ... 115820.592547 71012.649463 73584.761409 77315.43855 107316.446103 239642.53219 34064.27593 62672 43510 30460
3 30990.0 2017.0 41124.0 auburn toyota tundra double cab sr good gas clean other ... 235060.628139 71012.649463 73584.761409 77315.43855 107316.446103 239642.53219 34749.481707 62672 43510 30460
4 15000.0 2013.0 128000.0 auburn ford f150 xlt excellent gas clean automatic ... 35929.010235 51346.825953 73584.761409 77315.43855 40796.308366 239642.53219 18396.925068 338265 35279 62859

5 rows × 25 columns

In [17]:
#### Create Test and Train dataset for the Model ##### 
X_transformed = X[['price','year','odometer','region_encoded', 'manufacturer_encoded', 'condition_encoded','fuel_encoded', 'title_status_encoded', 'drive_encoded', 'state_encoded','transmission_encoded', 'type_encoded', 'paint_color_encoded']]
X_scaled = (X_transformed - X_transformed.mean())/(X_transformed.std())

# Scale the dataset
X_t = X_scaled.drop(columns = 'price') 
y_t = X_scaled['price']

# Train the dataset
X_train, X_test, y_train, y_test = train_test_split(X_t, y_t, test_size = 0.3, random_state = 42)
X_train.head()
Out[17]:
year odometer region_encoded manufacturer_encoded condition_encoded fuel_encoded title_status_encoded drive_encoded state_encoded transmission_encoded type_encoded paint_color_encoded
158946 0.187207 0.082042 -0.197052 -0.529538 -0.049189 -0.126011 0.180677 0.779027 -0.339408 0.508288 1.127212 1.008512
69643 0.61074 -0.223239 -0.168788 0.386326 -0.049189 -0.613784 0.180677 0.779027 0.257239 0.508288 -0.804167 1.008512
33466 -0.024559 0.384802 -0.198494 -0.529538 -0.273124 -0.126011 0.180677 0.779027 0.257239 0.508288 1.127212 -0.689043
311528 -0.236326 0.107359 -0.186722 4.355051 -0.049189 3.089299 0.180677 0.779027 0.914641 0.508288 -0.323078 1.008512
176464 0.187207 0.084479 -0.203122 0.720762 -0.273124 -0.126011 0.180677 0.779027 -0.329358 -2.189783 -0.323078 -1.343997
In [18]:
## Plot co-relation matrics on the feature column.

corr = X_train.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corr,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values,
            cmap="coolwarm",
            vmin=-1,
            vmax=1,
            annot=True,
           fmt='.2f')
plt.title("Correlation Heatmap of vehicle dataset")
plt.show()
corr.head(200)
No description has been provided for this image
Out[18]:
year odometer region_encoded manufacturer_encoded condition_encoded fuel_encoded title_status_encoded drive_encoded state_encoded transmission_encoded type_encoded paint_color_encoded
year 1.000000 -0.155131 -0.017714 -0.018380 -0.174635 -0.040043 0.030979 0.019509 -0.008890 -0.009237 -0.054727 0.042891
odometer -0.155131 1.000000 0.014160 0.001405 0.068618 0.051401 -0.011579 0.018927 0.003487 0.067937 0.037033 0.006945
region_encoded -0.017714 0.014160 1.000000 -0.000704 0.001167 0.010083 0.006737 -0.004000 0.582148 0.004632 0.012009 0.003770
manufacturer_encoded -0.018380 0.001405 -0.000704 1.000000 -0.006395 -0.063489 0.014712 0.046917 0.002449 0.037917 0.004149 -0.017659
condition_encoded -0.174635 0.068618 0.001167 -0.006395 1.000000 0.014655 -0.026491 -0.015792 0.003404 -0.025620 -0.008420 -0.031176
fuel_encoded -0.040043 0.051401 0.010083 -0.063489 0.014655 1.000000 0.005178 0.147375 0.011325 0.082067 -0.040669 0.085349
title_status_encoded 0.030979 -0.011579 0.006737 0.014712 -0.026491 0.005178 1.000000 0.015577 0.007637 -0.034818 -0.039897 0.023134
drive_encoded 0.019509 0.018927 -0.004000 0.046917 -0.015792 0.147375 0.015577 1.000000 -0.009967 0.027020 -0.035340 0.178237
state_encoded -0.008890 0.003487 0.582148 0.002449 0.003404 0.011325 0.007637 -0.009967 1.000000 -0.002688 0.012196 0.011269
transmission_encoded -0.009237 0.067937 0.004632 0.037917 -0.025620 0.082067 -0.034818 0.027020 -0.002688 1.000000 0.149585 0.048032
type_encoded -0.054727 0.037033 0.012009 0.004149 -0.008420 -0.040669 -0.039897 -0.035340 0.012196 0.149585 1.000000 0.154249
paint_color_encoded 0.042891 0.006945 0.003770 -0.017659 -0.031176 0.085349 0.023134 0.178237 0.011269 0.048032 0.154249 1.000000
In [19]:
## Execute the PCA on the dataframe and plot the component and variance ratio 

# Set the iterates to run on the number of features in the dataframe
iterates = np.arange(12)
var_ratio = []

# Execute for loop and store the variance ration for each run
for iterate in iterates:
  pca = PCA(n_components=iterate)
  pca.fit(X_t)
  var_ratio.append(np.sum(pca.explained_variance_ratio_))

#Plot the figure 
plt.figure(figsize=(10,8),dpi=150)
plt.grid()
plt.plot(iterates,var_ratio,marker='o')
plt.xlabel('n_components')
plt.ylabel('Explained variance ratio')
plt.title('n_components vs. Explained Variance Ratio')
Out[19]:
Text(0.5, 1.0, 'n_components vs. Explained Variance Ratio')
No description has been provided for this image

Data preparation conclusion¶

  1. Improved the data quality of the vehicle dataset: a. Removing unwanted and high missing values data columns & rows b. Imputed the missing data using mean and most frequent strategy. c. Removed the junk character from the columns d. Scaled the Dataset

  2. Generated co-relation heatmap and executed PCA on all the features a. The result shows that features are not closely related except for state & region. b. The variance relation graph is slightly converging, therefor all features need to be examined in the model.

  3. The final dataset is ready for Model execution.

Modeling¶

With your (almost?) final dataset in hand, it is now time to build some models. Here, you should build a number of different regression models with the price as the target. In building your models, you should explore different parameters and be sure to cross-validate your findings.

Generated the following models to evaluate the Vehicle dataframe

  1. Linear Regression with polynomial features and sequential feature selection
  2. Ride Regression Model with Lasso feature selector
  3. Linear regression Model with polynomial features and Lasso feature selection
  4. Lasso Regression Model with polynomial features

Calculated the MSE on Test and Training datasets for comparison

In [22]:
#### Model_1 : Linear Regression with Sequential Feature Selection ####

# Selector pipeline to run the selector & Linear Regression
selector_pipe = Pipeline([('selector', SequentialFeatureSelector(LinearRegression())),
                         ('model', LinearRegression())])
# Print Pipeline steps 

selector_pipe
Out[22]:
Pipeline(steps=[('selector',
                 SequentialFeatureSelector(estimator=LinearRegression())),
                ('model', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('selector',
                 SequentialFeatureSelector(estimator=LinearRegression())),
                ('model', LinearRegression())])
SequentialFeatureSelector(estimator=LinearRegression())
LinearRegression()
LinearRegression()
LinearRegression()
In [23]:
#### Execute Model_1 and extract the mean square error for Training & Test Datasets ####

# Parm to iterate over several features and select the right number using Grid search cross validator 
param_dict = {'selector__n_features_to_select': [2, 4, 6, 10]}
selector_grid = GridSearchCV(selector_pipe, param_grid=param_dict)

# Pridict and calculate the mean squared error for training and test datasets: 
selector_grid.fit(X_train, y_train)
train_preds = selector_grid.predict(X_train)
test_preds = selector_grid.predict(X_test)

# Calculate the MSE for Training ad Test datasets:
selector_train_mse = mean_squared_error(y_train, train_preds)
selector_test_mse = mean_squared_error(y_test, test_preds)

# Print the values:
print(f'Train MSE: {selector_train_mse}')
print(f'Test MSE: {selector_test_mse}')
Train MSE: 1.160192176358472
Test MSE: 0.6242452522600143
In [24]:
#### Model_2 Ride Regression Model with Lasso feature selector #### 

ridge_param_dict = {'ridge__alpha': np.logspace(0, 10, 50)}

# Prepare ridge pipeline with Lasso feature selection 
ridge_pipe = Pipeline([
                      ('poly_features', PolynomialFeatures(degree = 3, include_bias = False)),
                       ('ridge', Ridge())])

#Print the pipeline steps
print('## RIDGE REGRESSION MODEL ##')
ridge_pipe
## RIDGE REGRESSION MODEL ##
Out[24]:
Pipeline(steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('ridge', Ridge())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('ridge', Ridge())])
PolynomialFeatures(degree=3, include_bias=False)
Ridge()
In [25]:
## Iterate the alpha using Grid cross validator
ridge_grid = GridSearchCV(ridge_pipe, param_grid=ridge_param_dict)

# Fit the pipeline, predict using Test & Train datasets and calculate the MSE errors
ridge_grid.fit(X_train, y_train)
ridge_train_preds = ridge_grid.predict(X_train)
ridge_test_preds = ridge_grid.predict(X_test)

# Claculate the mean squared errors
ridge_train_mse = mean_squared_error(y_train, ridge_train_preds)
ridge_test_mse = mean_squared_error(y_test, ridge_test_preds)

# Print the MSE values for Test and Train 
print(f'Train MSE: {ridge_train_mse}')
print(f'Test MSE: {ridge_test_mse}')
Train MSE: 1.1599686178671338
Test MSE: 0.6245355741141243
In [26]:
#### MODEL_3 Linear regression Model with Lasso feature selection #### 

model_selector_pipe = Pipeline([('poly_features', PolynomialFeatures(degree = 3, include_bias = False)),
                                ('selector', SelectFromModel(Lasso())),
                                ('linreg', LinearRegression())])

# Print the model
print('## RIDGE REGRESSION MODEL ##')
model_selector_pipe
## RIDGE REGRESSION MODEL ##
Out[26]:
Pipeline(steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('selector', SelectFromModel(estimator=Lasso())),
                ('linreg', LinearRegression())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('selector', SelectFromModel(estimator=Lasso())),
                ('linreg', LinearRegression())])
PolynomialFeatures(degree=3, include_bias=False)
SelectFromModel(estimator=Lasso())
Lasso()
Lasso()
LinearRegression()
In [27]:
#### Model Execution #### 

# Pridict the data frame and calculate the mean squared error for Train & Test splits

fx = model_selector_pipe.fit(X_train, y_train)
selector_train_mse = mean_squared_error(y_train, model_selector_pipe.predict(X_train))
selector_test_mse = mean_squared_error(y_test, model_selector_pipe.predict(X_test))

# Print the Train & Test Datasets
print(selector_train_mse)
print(selector_test_mse)
1.153049416785618
0.6285839373470016
In [28]:
## Identify the selected features by the model and get the feature coefficients 

selector = model_selector_pipe.named_steps['selector']

# Get the mask of selected features
selected_features_mask = selector.get_support()

# Get selected feature names from the polynomial features
poly_features = model_selector_pipe.named_steps['poly_features']
selected_feature_names = pd.DataFrame(poly_features.get_feature_names_out()[selected_features_mask])
selected_feature_names.columns=['Feature Name']

# Get the feature coefficients  
coef= model_selector_pipe.named_steps['selector'].estimator_.coef_
coef_sel = pd.DataFrame(coef[selected_features_mask])
coef_sel.columns =['Coefficient']

fs = pd.concat((selected_feature_names, coef_sel), axis=1)
print(fs)
                           Feature Name  Coefficient
0                       year odometer^2    -0.000054
1                      region_encoded^3     0.000093
2  manufacturer_encoded state_encoded^2     0.000951
3                       state_encoded^3     0.000034
In [29]:
#### Model_4 Lasso Regression Mode #### 

# Create a Pipeline with Polynominal features, Sequential feature selection and Linear regression  

auto_pipe = Pipeline([
                        ('polyfeatures', PolynomialFeatures(degree = 3, include_bias = False)),
#                        ('selector', SequentialFeatureSelector(LinearRegression(), n_features_to_select='auto')) ,
                        ('lasso', Lasso(random_state = 42))
                       ])
## Print the model steps
auto_pipe
Out[29]:
Pipeline(steps=[('polyfeatures',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('lasso', Lasso(random_state=42))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('polyfeatures',
                 PolynomialFeatures(degree=3, include_bias=False)),
                ('lasso', Lasso(random_state=42))])
PolynomialFeatures(degree=3, include_bias=False)
Lasso(random_state=42)
In [30]:
## Model_4 Lasso Regression execution

# Predict the model 
auto_pipe.fit(X_train, y_train)
lasso_coefs = auto_pipe.named_steps['lasso'].coef_

# Calculate the Test and Train MSE's 
lasso_train_mse = mean_squared_error(y_train, auto_pipe.predict(X_train))
lasso_test_mse = mean_squared_error(y_test, auto_pipe.predict(X_test))

# Print the Lasso MSES
print(lasso_train_mse)
print(lasso_test_mse)

# features names selected by the Model and lasso coefficient 
feature_names = auto_pipe.named_steps['polyfeatures'].get_feature_names_out()
lasso_df = pd.DataFrame({'feature': feature_names, 'coef': lasso_coefs})

# Print Feature names
print(type(feature_names))
lasso_df.loc[lasso_df['coef'] != 0]
1.158084602773126
0.6246889064643456
<class 'numpy.ndarray'>
Out[30]:
feature coef
102 year odometer^2 -0.000054
168 odometer^3 0.000004
171 odometer^2 condition_encoded -0.000002
234 region_encoded^3 0.000093
324 manufacturer_encoded state_encoded^2 0.000951
434 state_encoded^3 0.000034
In [31]:
## Find the best model and number of features selected by the Sequential selector along with features coefficients

best_estimator = selector_grid.best_estimator_
best_selector = best_estimator.named_steps['selector']
best_model = selector_grid.best_estimator_.named_steps['model']
feature_names = X_train.columns[best_selector.get_support()]
coefs = best_model.coef_

# Print best estimator
print(best_estimator)
print(f'Features from best selector: {feature_names}.')
print('Coefficient values: ')
print('===================')
pd.DataFrame([coefs.T], columns = feature_names, index = ['model'])
Pipeline(steps=[('selector',
                 SequentialFeatureSelector(estimator=LinearRegression(),
                                           n_features_to_select=2)),
                ('model', LinearRegression())])
Features from best selector: Index(['region_encoded', 'fuel_encoded'], dtype='object').
Coefficient values: 
===================
Out[31]:
region_encoded fuel_encoded
model 0.029021 0.001582

Evaluation¶

With some modeling accomplished, we aim to reflect on what we identify as a high quality model and what we are able to learn from this. We should review our business objective and explore how well we can provide meaningful insight on drivers of used car prices. Your goal now is to distill your findings and determine whether the earlier phases need revisitation and adjustment or if you have information of value to bring back to your client.

In [33]:
## MSE comparison between various models Train/ Test 
## 
MSE = {
     'Model':     ['Linear Regression', 'Ridge Regression', 'Linear Regression(Lasso selector)', 'Lasso Regression'], 
     'Train MSE': [1.160192176358472, 1.1602148172330142, 1.153049416785618, 1.158084602773126], 
     'Test MSE':  [0.6242452522600143, 0.6241035347444294, 0.6285839373470016, 0.6246889064643456]
}
MSED = pd.DataFrame(MSE)


fig = px.scatter(MSED, x='Model', y='Train MSE', color='Model', size='Train MSE', title = 'Mean Squared Error for Model based on Train dataset')
fig1 = px.scatter(MSED, x='Model', y='Test MSE', color='Model',size='Test MSE', title = 'Mean Squared Error for Model based on Test dataset')
fig.show()
fig1.show()
MSED.head(4)
Out[33]:
Model Train MSE Test MSE
0 Linear Regression 1.160192 0.624245
1 Ridge Regression 1.160215 0.624104
2 Linear Regression(Lasso selector) 1.153049 0.628584
3 Lasso Regression 1.158085 0.624689
In [73]:
## The Error comparison shows Linear regression with the Lasso selector seems to be a good Model when predicted with both Test & Train datasets

## Plot the selected features & importance cofficients 

fig = px.bar(fs,x='Feature Name', y='Coefficient', color= 'Feature Name', 
              width=800,
              height=600,
             title = 'Feature importance based based on coefficients')
fig.show()

Conclusion¶

Evaluating the MSE based on Training & Test scaled data, the model Linear Regression with Lasso selector and polynomial features is giving better results compared to other Models. Therefore selecting this model for the final Car price evaluation and presenting it to Car dealers.

Deployment¶

Now that we've settled on our models and findings, it is time to deliver the information to the client. You should organize your work as a basic report that details your primary findings. Keep in mind that your audience is a group of used car dealers interested in fine tuning their inventory.

The key features to focus on by dealers which can give better prices on the car are :

  1. Car Manufacturer
  2. Region
  3. State
  4. Year of manufacture
  5. Low Odometer reading

Other attributes to consider are:

  1. Fuel Efficiency
  2. Condition of Car

There is a direct dependency on Car price related to the manufacturer and state car is sold. The inventory of used cars can be arranged based on such car manufacturers.

Also, the low odometer reading and year of manufacture are related and attributed to the high price for Car.